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Abstract —We investigate the problem of estimating the 3D shape of an object defined by a set of 3D landmarks, given their 2D 
correspondences in a single image. A successful approach to alleviating the reconstruction ambiguity is the 3D deformable shape 
model and a sparse representation is often used to capture complex shape variability. But the model inference is still challenging due to 
the nonconvexity in the joint optimization of shape and viewpoint. In contrast to prior work that relies on an alternating scheme whose 
solution depends on initialization, we propose a convex approach to addressing this challenge and develop an efficient algorithm to 
o solve the proposed convex program. We further propose a robust model to handle gross errors in the 2D correspondences. We 

H demonstrate the exact recovery property of the proposed method, the advantage compared to several nonconvex baselines and the 

applicability to recover 3D human poses and car models from single images. 
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Index Terms —3D reconstruction, sparse representation, convex optimization. 



r4-i Introduction 

> 

Recognizing 3D objects from 2D images is a central problem 
m computer vision. Past years have witnessed an emerging 

J end towards analyzing 3D geometry of objects including 
iapes and poses instead of merely providing bounding 
boxes [1], [2]. The 3D geometric reasoning can not only 
ymrovide richer information about the scene for subsequent 
high-level tasks such as scene understanding, augmented 
ality and human computer interaction, but also improve 
recognition performances [3], [4]. 3D reconstruction has 
Tureen a well studied problem and there have been many 
(practically applicable techniques such as structure from 
Q^otion, multi-view stereo and depth sensors, but these 
fTbchniques are limited in some scenarios. For example, a 
l/T&rge number of acquisitions from multiple views are often 
r T^quired in order to obtain a complete 3D model, which 
Jjs not preferred in some real-time applications; the depth 
^jkisors in general cannot work outdoor and have a limited 
rsensing range; and reconstructing a dynamic scene is still an 
open problem. In this paper, we aim to investigate the pos¬ 
sibility of estimating the 3D shape of an object from a single 
2D image, which is complementary to the aforementioned 
techniques and may potentially address the above issues. 

Estimating the 3D geometry of an object from a single 
view is an ill-posed problem. But it is a possible task for a 
human observer, since human can leverage visual memory 
of object shapes. Inspired by this idea, much effort has 
been made towards 3D model-based analysis leveraging the 
increasing availability of online 3D model databases, such 
as the Google 3D warehouse that includes millions of CAD 
models of various objects and many publicly available shape 
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scans and motion capture data that model the 3D shape and 
pose of human. 

To address intra-class variability and nonrigid deforma¬ 
tion and avoid exhaustively enumerating all possibilities, 
many previous works, e.g. [5], [6], adopted a 3D deformable 
model originated from the "active shape model" [7] to repre¬ 
sent shapes, where each shape is defined by a set of ordered 
landmarks and the one to be estimated is assumed as a lin¬ 
ear combination of some basis shapes usually obtained from 
principal component analysis (PCA). For human poses, 
the sparse representation was proposed to handle highly 
articulated deformation of human bodies which cannot be 
captured by PC A [8], [9]. In model inference, the 3D de¬ 
formable model is matched to the landmarks annotated or 
detected in images, and the problem is reduced to a 3D- 
to-2D shape fitting problem where the shape (weights of 
the linear combination) and the viewpoint (camera extrinsic 
parameters) have to be estimated simultaneously. Figure 1 
gives an illustration of the problem. 

While this approach has achieved promising results in 
various applications, the model inference is still a challeng¬ 
ing problem, since the subproblems of shape and viewpoint 
estimation are coupled: the viewpoint needs to be known 
in order to fit the 3D model to 2D, and inversely, the 3D 
model needs to be known in order to estimate the view¬ 
point. The joint estimation of shape and viewpoint results 
in a nonconvex optimization problem, and the orthogonal¬ 
ity constraint on the camera rotation makes the problem 
even more complicated. Previous methods often rely on 
an alternating scheme to alternately update the shape and 
viewpoint parameters, which has no guarantee for global 
convergence and is sensitive to initialization. As mentioned 
in many previous works, e.g. [8], [5], most of the failed 
cases were attributed to bad initialization. Some heuristics 
have been proposed to address this issue, such as initializing 
multiple times [9] or using a viewpoint-aware detector [6]. 
But there is still no guarantee for global optimality. 

In this paper, we propose a convex relaxation approach 
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Fig. 1. Illustration of the problem studied in this paper. The unknown 3D model is defined by a set of landmarks and assumed to be a linear 
combination of some predefined basis shapes with sparse coefficients. Given the 2D correspondences of the landmarks in a single image, the 
computational problem is to simultaneously estimate the coefficients of the sparse representation as well as the viewpoint of the camera. 


to addressing the aforementioned issue. We use an aug¬ 
mented shape-space model, in which a shape is represented 
as a linear combination of rotatable basis shapes. This model 
gives a linear representation of both intrinsic shape defor¬ 
mation and extrinsic viewpoint changes. Next, we use the 
convex relaxation of the orthogonality constraint to convert 
the entire problem into a spectral-norm regularized least 
squares problem, which is a convex program. Then, we 
develop an efficient algorithm to solve the proposed convex 
program based on the alternating direction method of mul¬ 
tipliers (ADMM). To achieve fast computation, we derive 
a closed-form solution to compute the proximal operator 
of the spectral norm. Furthermore, we extend the proposed 
model to handle outliers in the input 2D correspondences. 
This work extends its earlier version [10] with the exten¬ 
sions including postprocessing for rotation synchronization, 
outlier modeling, more experiments and real examples with 
learned detectors. 

The remainder of this paper is organized as follows. We 
summarize related work in Section 2, formulate the problem 
in Section 3, introduce the proposed method and technical 
details in Section 4, report empirical results in Section 5, and 
finally conclude the paper in Section 6. The code is available 
at http://cis.upenn.edu/~xiaowz/shapeconvex.html. 

2 Related work 

The most related work includes the ones that fit a 3D 
deformable model to the features in a 2D image. A popular 
application is human face modeling for various tasks such 
as recognition [1 ], feature tracking [12] and facial animation 
[1 ]. Recently, there has been an increased number of works 
on 3D object modeling. For example, Hejrati and Ramanan 
[5] used the deformable model for 3D car modeling. They 
produced 2D landmarks by a variant of deformable part 
models [14] and then lifted the 2D model to 3D. Lin et 
al. [1 ] proposed a method for joint 3D model fitting and 
fine-grained classification for cars. Zia [6] et al. developed 
a probabilistic framework to simultaneously localize 2D 
landmarks and recover 3D object models. 

While the conventional active shape model performs 
well to describe the shape variability of rigid objects, it can 
hardly handle the structural variability of nonrigid objects 
such as human bodies. To address this issue, Ramakrishna 
et al. [I ] proposed a sparse representation based approach 
to reconstructing 3D human pose from annotated joints in 
a still image. Wang et al. [9] adopted a 2D human pose 
detector to automatically locate the joints and used a robust 
estimator to handle inaccurate joint locations. Fan et al. [16] 
proposed to improve the performance of [ 8 ] by enforcing 


locality when building the pose dictionary. Zhou and De La 
Torre [17] formulated human pose estimation as a matching 
problem, where a learned spatio-temporal pose model was 
matched to point trajectories extracted from a video. Akhter 
and Black [18] integrated a joint-angle constraint into the 
sparse representation to reduce the possibility of invalid 
reconstruction. 

Recently, there was growing interest in reconstructing 
category-specific object models from a collection of single 
images of different instances with the same category [19], 
[1 ], [21], [2 ]. The deformable shape model was adopted 
and fitted to either visual hulls in the pre-segmented images 
or landmarks annotated in the images, and at the same time 
the basis shapes were also learned from images during the 
model inference. The problem studied in this paper can be 
regarded as a subproblem in this process where the basis 
shapes are given. 

A common component or an intermediate step in the 
works mentioned above is to fit a 3D deformable model to 
2D correspondences. As we mentioned in the introduction, 
these works relied on nonconvex optimization, which may 
be sensitive to initialization. The convex formulation pro¬ 
posed in this paper can potentially serve as a building block 
or provide a good initialization to improve the performance 
of the existing methods. 

Another line of work tried to solve single-image re¬ 
construction by finding the nearest neighbor in the shape 
collection followed by a refinement step [23], [2 ]. The 
initial shape and viewpoint were found by enumerating 
all instances and viewpoints and comparing the test image 
with the rendered one. The initial estimate was refined by 
optimizing both the viewpoint and nonrigid deformation 
according to image contours. This approach produces very 
detailed reconstruction and is applicable to a wide range of 
object categories, but it is computationally expensive and 
requires accurate image segmentation. 

There are also many alternative approaches for 3D 
human pose recovery from single images such as using 
a known articulated skeleton [25], [26], [27], probabilistic 
graphical models [28], [2 ], explicit regression [30], [31], to 
name a few. But these approaches are customized for human 
pose modeling and not straightforwardly generalizable for 
other objects. 

Our work is also closely related to nonrigid struc¬ 
ture from motion (NRSfM), where a deformable shape is 
recovered from multi-frame 2D-2D correspondences. The 
low-rank shape-space model has been frequently used in 
NRSfM, but the basis shapes are unknown. The joint estima¬ 
tion of shape/pose variables and basis shapes was typically 



3 


solved via matrix factorization followed by metric rectifi¬ 
cation [32], [33]. In some recent works, iterative algorithms 
were employed for better precision [34], [35] or sequential 
processing [36], and the problem studied in this paper is 
analogous to the step of fixing basis shapes and updating 
the remaining variables in those works. 


3 Problem statement 

The problem studied in this paper can be described by the 
following linear system: 


W = ITS, 


(1) 


where S G R 3xp denotes the unknown 3D shape, which 
is represented by 3D locations of p points. W G R 2xp 
denotes their projections in a 2D image. II is the cam¬ 
era calibration matrix. To simplify the problem, the weak- 
perspective camera model is usually used, which is a good 
approximation when the object depth is much smaller than 
the distance from the camera. With this assumption, the 
calibration matrix has the following simple form: 


n = 


s 0 
0 s 


0 " 

0 ’ 


( 2 ) 


where s is a scalar depending on the focal length and the 
distance to the object. 

There are always more variables than equations in (1). To 
make the problem well-posed, a widely-used assumption 
is that the unknown shape can be represented as a linear 
combination of predefined basis shapes, which is originated 
from the active shape model [7]: 


k 

S = Y,*Bi, (3) 

2=1 

where Bi G R 3xp for i G [1, k\ represents a basis shape and 
Ci its weight in the combination. Thus, the reconstruction 
problem is reduced to a problem of estimating several 
coefficients by fitting the model (3) to the landmarks in an 
image, which greatly reduces the number of unknowns. 

Since the basis shapes are predefined, the relative ro¬ 
tation and translation between the camera frame and the 
coordinates that define the basis shapes need to be taken 
into account, and the 3D-2D projection is depicted by: 


W = n^Rj2 c iBi+Tl T j , (4) 

where R £ R 3x3 and T £ M 3 correspond to the rotation 
matrix and the translation vector, respectively. R should be 
in the special orthogonal group 

50(3) = {Re R 3 x 3 \R t R = I 3 ,detR = 1}. (5) 


Equation (4) can be further simplified as 

k 

W = RY J c i B u (6) 

2=1 

where R G R 2x3 denotes the first two rows of the rotation 
matrix, and the translation T has been eliminated by cen¬ 
tralizing the data, i.e. subtracting each row of W and B by 


its mean. Note that the scalar s in the calibration matrix has 
been absorbed into ci, • • • , c&. 

In the active shape model [7], the principal components 
of training samples are used as the basis shapes, which as¬ 
sumes that all shapes lie in a low-dimensional linear space. 
In more recent work, e.g. [8], [37], [38], [39], it has been 
shown that the low-dimensional linear space is insufficient 
to model complex shape variation, e.g., human poses, and 
a promising approach is to use an over-complete dictionary 
and represent an unknown shape as a sparse combination 
of atoms in the dictionary. Such a sparse representation 
implicitly encodes the assumption that the unknown shape 
should lie in a union of subspaces that approximates a 
nonlinear shape manifold. The rep resentabili ties of PC A and 
sparse representation for 3D human pose modeling will be 
empirically compared in Section 5.2. 

Based on the sparse representation of shapes, the fol¬ 
lowing type of optimization problem is often considered to 
estimate an unknown shape: 


min 

c,R 

S.t. 


W - Rj2 c iB, 


2=1 


RR T = 12, 


“ c \\u 


( 7 ) 


where c = [ci,**- , Ck] T and ||c||i represents the t\ norm 
of c, which is the convex surrogate of the cardinality. || • \\f 
denotes the Frobenius norm of a matrix. The terms in the 
loss function of (7) correspond to the reprojection error and 
the sparsity of representation, respectively. 

The optimization in (7) is nonconvex and there is an or¬ 
thogonality constraint. A commonly-used strategy to solve 
the optimization is the alternating minimization scheme, in 
which two steps are alternated: (1) fixing R and updating 
c by any t\ minimization solver, e.g., the methods summa¬ 
rized in [40]; (2) fixing c and updating R using certain rota¬ 
tion representations such as the quaternions, the exponential 
map or a manifold representation. In some previous works 
[8], [18], R is updated by the singular value decomposition 
(SVD), and it is worth noting that this method can only yield 
an approximate solution since R is not a square matrix and 

— T — 

R R ^ I. Generally no closed-form solution exists in this 
case [41]. 

The alternating algorithm is summarized in Algorithm 1. 
Due to the nonconvexity of (7), Algorithm 1 may get stuck 
at local minima when initialization is far away from the true 
solution. 


Input: W 
Output: c and R. 

1 initialize c and R ; 

2 while not converged do 

3 update c using any i\ minimization solver; 

4 update R using SVD or any local optimization 
solver over SO (3); 

5 end 

Algorithm 1: Alternating minimization to solve (7). 
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4 Proposed methods 
4.1 Convex relaxation 

We propose to use the following shape-space model: 

k 

S = Y,*RiBi, ( 8 ) 

in which there is a rotation for each basis shape. The model 

in (8) implicitly accounts for the viewpoint variability and 
the projected 2D model is 

k k 

W = Uj2c i R i B i = J2M i B i , (9) 

i= 1 i= 1 

where Mi E M 2x3 is the product of q and the first two rows 
of Ri, which satisfies 

MiMj = c?J 2 . (10) 


The motivation of using the model in (9) is to achieve a 
linear representation of shape variability in 2D, such that we 
can get rid of the bilinear form in (6), which is a necessary 
step towards a convex formulation. 

The model in (9) is equivalent to the affine-shape model 
in literature [42], [43], which uses an augmented linear space 
to represent the shape variation in 2D caused by both in¬ 
trinsic shape deformation and extrinsic viewpoint changes. 
This representation also appears in most NRSfM literature, 
e.g. [32], [34]. As mentioned in [43], the augmented linear 
space can represent any 2D shape produced by the 3D shape 
model projected into the image plane, but the increased 
degree of freedom may result in invalid shapes. In this 
work, we try to reduce the possibility of invalid cases 
by enforcing the orthogonality constraint on ilT^s and the 
sparsity constraint on the number of activated basis shapes. 
We will show that these constraints can be conveniently 
imposed by minimizing a convex regularizer. 

Next, replacing the orthogonality constraint in (10) by its 
convex counterpart is considered. The following lemma has 
been proven in literature [44, Section 3.4]: 

Lemma 1. The convex hull of the Stiefel manifold Q = 
jx E R mxn | X T X = J n | equals the unit spectral-norm ball 
conv(Q) = {X EM mxn | ||X|| 2 < 1}. ||X|| 2 denotes the 
spectral norm (a.k.a. the induced 2-norm) of a matrix X , which 
is defined as the largest singular value of X. 

Based on Lemma 1, we have the following proposition: 

Proposition 1. Given a scalar s, the convex hull of S = 
jx El mxn | T t T = s 2 / n j equals the spectral-norm ball 
with a radius of \s\: conv (S) = {Y E M mxn | ||X|| 2 < |s|}. 

Proof Since S = {X | Y = |s|X, X E Q}, we have 

( k k 

conv (*5) »<j °i Yi I V e <S, <9* > 0, ^ (9 = 1 

i= 1 i= 1 

k k 

\s\Xi I Xi€QA> O,E 0 = 1 

i= 1 i= 1 

= \s\ • conv (Q ). 



□ 


Consequently, the tightest convex relaxation to the con¬ 
straint in (10) is given by ||M^| 2 < |q|. 

Finally, with the modified shape model, the relaxed or¬ 
thogonality constraint and the assumption of sparse repre¬ 
sentation, the following optimization is proposed for shape 
recovery under the noiseless case: 


min I 

Cl,--- ,c k ,M u <~ ,Mk 

i=l 


i.t. W = MiBi, 


i= 1 


HMilla < |ci|, V*e[l,fc]. (11) 


In (11), reducing \cf\ can decrease the objective and the 
only constraint on \cf is that ||iLfi|| 2 < \cf. Therefore, \cf 
is equal to ||M^|| 2 when an optimum is attained. Conse¬ 
quently, (11) is equivalent to the following problem: 


k 

min V II Mi 

Mu-Mk 


s.t. W = J2 M iBi- 

i= 1 


( 12 ) 


The formulation in (12) is a linear inverse problem, where 
a set of orthogonal matrices 1 are estimated by minimizing 
their spectral norms. Interestingly, the conditions for exact 
recovery using such a convex program has been theoreti¬ 
cally analyzed in [4 ]. Numerical results will be presented 
in Section 5.1 to demonstrate the exact recovery property. 

To consider noise in real applications, the following 
formulation is proposed: 


mm - 

M u -,Mk 2 




i= 1 


a 


Eh m ' 


i\ 2 - 


(13) 




The problem (13) is the final formulation, which is a penal¬ 
ized least-squares problem. We have following remarks: 


1. The problem in (13) is convex programming, which 
can be solved globally. We will provide an efficient 
algorithm to solve it in Section 4.3. 

2. Notice that || • || 2 in the above formulations denotes 
the spectral norm of a matrix. As we will show in Sec¬ 
tion 4.2, minimizing the spectral norm of a matrix is 
equivalent to minimizing the ^oo-norm of the vector 
of singular values, which will simultaneously shrink 
the norm of the matrix towards zero and enforce its 
singular values to be equal. Therefore, by spectral- 
norm minimization, we can not only minimize the 
number of activated basis shapes but also enforce 
each transformation matrix Mi to be orthogonal (an 
orthogonal matrix has equal singular values). 

3. For the cases with missing or invisible landmarks, 
the model parameters can be estimated by minimiz¬ 
ing reprojection errors of observed landmarks, i.e., 
including a binary weight matrix in the first term of 
(13). The unobserved landmarks can be hallucinated 
from the reconstructed model as their locations are 
known on the basis shapes. 


1. For simplicity, we call a matrix X as orthogonal matrix if it satisfies 
X T X = si or XX T = si where s is a scalar. 
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Fig. 2. Illustration of the proximal operator of the spectral norm. 


4.2 Proximal operator of the spectral norm 

Before providing the specific algorithm to solve (13), we 
first prove the following proposition, which will serve as 
an important building block in our algorithm. 

Proposition 2. The solution to the following problem 

mmf\A-X\\ 2 F + X\\X\\ 2 (14) 

is given by X* =V\(A), where 

V X (A) = U A diag [a A - XV (l {ct a /X)\ V t a , (15) 

U a, V a and a a denote the left singular vectors , right singular 
vectors and the singular values of A, respectively. Vi x is the 
projection of a vector to the unit £\-norm ball 

Proof The problem in (14) is a proximal problem [4 ]. The 
proximal problem associated with a function F is defined as 

pr°x AF (A) = ai'g mill t ||A— X\\% + XF(X), (16) 

with the solution denoted by prox AF (A) and named the 
proximal operator of F. 

For the problem (14), F(X) = ||X|| 2 = ||crv||oo/ where 
|| • | |oo means the norm. It says that F is a spectral 
function operated on singular values of a matrix. Based on 
the property of spectral functions [46, Section 6.7.2], we have 

prox AF (A) = U A diag [prox A/ (<r A )j V T A , (17) 

where / is the ^-norm. The proximal operator of the 
^oo-norm can be computed by Moreau decomposition [46, 
Section 6.5]: 


pro x a/ (<ta) = <t A ~ XVe 1 (cr A /X), (18) 

given that the f?i-norm is the dual norm of the ^oo-norm. □ 

The solution for the case with two singular values is 
illustrated in Figure 2, where [Ai,A 2 ] t corresponds to 
(Ta — A'P^ 1 (<xa/A) in (15). From the illustration we have 
two observations: (1) If the projection is on the edge of 
the £i-norm ball, then Ai = A 2 and X* turns out to be 
orthogonal; (2) If a a is small and lies inside the f?i-norm 
ball, then the projection will be itself, Ai = A 2 =0, and X* 
turns out to be all-zero. These two observations illustrate the 
effects of spectral-norm regularization in (13): (1) Enforcing 
MiS to be orthogonal; (2) Pruning small ilT^s to achieve a 
sparse representation. 


4.3 Optimization 

The algorithm to solve (13) is presented here. The noiseless 
case (12) can be solved similarly. 

The proposed algorithm is based on ADMM [ z ] and 
the proximal operator of the spectral norm derived in Sec¬ 
tion 4.2. An auxiliary variable Z is introduced and (13) is 
rewritten as 


1 II -II 2 

min - \\W - ZB\\ 
m,z 2 II Wf 

s.t. M = Z, 


a 


Ell M ^> 


i= 1 


where 


M = [Mi 


M, 


B = 


B, 


(19) 


( 20 ) 


The augmented Lagrangian of (19) is 

1 2 & 

£„(M,Z,r) = - \\W-ZB\\ F + aJ2\\M. 


+ (Y,M-Z) + ||m--Z|| 


* II2 
2 


( 21 ) 


where Y is the dual variable and p is a parameter con¬ 
trolling the step size in optimization. Then, the ADMM 
alternates the following steps until convergence: 


*t +1 


M ' = arg min (M,Z t ,Y t '\ ; 

M V J 

Z t+1 = arg min( aT + \z, V) ; 


r 1 

Y t+i =y t +n (M +1 - Z‘ +1 ) . 
The subproblem in (22) can be rewritten as 
min £ /( (m, Z^Y*) 

M V J 


( 22 ) 

(23) 

(24) 


= mm - 
m 2 


M — Z + -Y 1 

p 


+ ^Eii M 




i 2 


= mm 

,Mj 


i=l 


2 ll M * - Q\ 


P 


}• 


(25) 


where Q\ is the i- th column-triplet of Z l — ^Y t . Therefore, 
each Mi can be updated separately by solving a proximal 
problem based on Proposition 2: 

Ml +1 =V f (Ql), Vie[l,fc]. (26) 

For the subproblem in (23), C t , (m + , Z. Y'j is a 
quadratic form and admits a closed-form solution: 

Z t+1 = (\VB T + nM t+1 + Y^ [bB T + m) 1 • (27) 


The steps are summarized in Algorithm 2. Following the 
standard proof in [4 ], it can be shown that the sequences of 
values produced by the iterations in Algorithm 2 converge 
to the optimal solution of the problem (19), which is also the 
optimal solution to the original problem (13). In implemen¬ 
tation, the convergence criterion and the adaptive scheme 
for tuning step size p from [47, Section 3] are adopted. 
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Input: W, a 
Output: M i, • • • ,Mk 

1 initialize Z — Y = 0, /z > 0; 

2 while not converged do 

3 parallel for i = 1 to k do 

4 Q\ = the z-th column-triplet of Z* — A V* ; 

5 M* +1 =D S (Q‘); 

6 end 

7 z t+1 = (Vi? T + /iM +1 + (bb t + jii) _1 

8 y-t+i = M (m‘ + 1 - Z t+1 J ; 

9 end 

Algorithm 2: ADMM to solve (13) 


4.4 Reconstruction 

After (13) is solved, two algorithms for reconstructing the 
3D shape from the estimated Mi s are proposed. 


4.4.1 Direct reconstruction 

The 3D shape is reconstructed according to the relaxed 
shape model (8), where q and Ri are recovered from the 
estimated Mi. The steps are summarized in Algorithm 3, 
where m\ and r\ denote the j-th row vectors of Mi 
and Ri, respectively. Note that q = — ||M ^||2 is also a 
feasible solution. To eliminate the ambiguity, we assume 
that Ci > 0 and impose this constraint when learning the 
shape dictionary. 


Input: M i, • • • , Mk 

Output: S 


i for i = 1 to k do 


Ci = || M 
(i) 

r) = m 


i 2/ 


rf^ = m 
r- 3) = r x r^ 2) ; 


M; 

M; 


_ r^(i) r (2) r (3)iT. 


^ = [ri 


7 end 

8 5 = X)i=i CiRiBi; 


Algorithm 3: Direct reconstruction 


4.4.2 Rotation synchronization and refinement 

The 3D shape is reconstructed according to the original 
shape model (3). In order to recover the coefficient vector 
c and a single rotation R from the estimated MiS, the 
following rotation synchronization problem is solved: 

k 

min V || Mi - CiR\\ 2 F , 

c,R “ 

s.t. RR T = I 2 - (28) 

The problem in (28) corresponds to the "metric projection" 
step in [34], [35], which can be exactly solved by semidefinite 
programming [34] or approximately by a more efficient 
algorithm [35, Section 4.4.1]. 

The solution from the synchronization described above 
is not necessarily the solution to the original problem (7), 
but it can be used as a good initialization and refined by 


the alternating minimization in Algorithm 1. In Section 5, 
it will be demonstrated that such a better initialization can 
clearly improve the performance of alternating minimiza¬ 
tion. A potential issue in the synchronization is that, when 
the rotations obtained from the convex program are very 
different from each other, the consensus of them might be 
meaningless and fails to provide a good initialization. 

The full steps are summarized in Algorithm 4. 


Input: Mi, • • • ,Mk 

Output: S 

1 initialize c and R by solving (28) using the algorithm 
in [35, Section 4.4.1]; 

2 refine c and R by Algorithm 1; 

3 S = J2i =i CiBi; 

Algorithm 4: Refinement and reconstruction 


4.5 Outlier modeling 


In real applications, the 2D correspondences are given by 
detectors and there are likely to be gross errors (outliers). In 
this case, we explicitly model the outliers by a sparse matrix 
E and modify the formulation in (13) as 


mm - 
,M k ,E,T 2 


W - MiBi - E - Tl 1 


7=1 




7=1 


2 


F 


(29) 


Note that in this case the translation T cannot be eliminated 
by centralizing the data at the beginning. 

The problem in (29) can be solved by ADMM as well. 
The algorithm is presented in Appendix A. 


4.6 Dictionary learning 

In general, one can simply use all existing or a few rep¬ 
resentative 3D models as the basis shapes. But when the 
collection is very large, e.g., a motion capture dataset often 
contains thousands of human skeletons, it is necessary to 
use the dictionary learning technique to learn a dictionary 
of basis shapes that can concisely summarize the variability 
in training data and enable a sparse representation. The 
dictionary learning problem can be formulated as follows: 

B , mi £ feC 12 l W S 3 -J2 C i3 Bi W 2 F + X W°h 

15 ’ 3 =1 i=l 

s.t. Cij >0, \\B,\\ f < 1, 

Vie[l,fc], je[l,n], (30) 

where Sj s denote the training shapes, BiS are the basis 
shapes to be learned, and represents the i -th coefficient 
for the j -th training shape. The two terms in the cost func¬ 
tion correspond to the reconstruction error and the sparsity 
of representation, respectively. The nonnegativity constraint 
is introduced to remove the ambiguity as discussed in Sec¬ 
tion 4.4.1. The problem in (30) is locally solved by alternately 
updating C and via projected gradient descent, an 
algorithm widely used in dictionary learning that converges 
to a local optimum [48]. The algorithm is summarized in 
Appendix B. 
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Cardinality of coefficients 

Fig. 3. The frequency of exact recovery. The intensity indicates the 
frequency of success under a variety of problem settings. 

5 Experiments 
5.1 Exact recovery 

We investigate whether the spectral-norm minimization in 
(12) can exactly solve the ill-posed inverse problem based 
on the prior knowledge of sparsity and orthogonality using 
simulation. 

More specifically, the data is synthesized with the fol¬ 
lowing model: 

k 

W = Y J M i B i , (31) 

i= 1 

where G R 3xp ( k = 50 and p is varying) 

are randomly generated bases with entries sampled in¬ 
dependently from the normal distribution A/"(0,1). Each 
Mi G M 2x3 is generated from Mi = CiRi, where c* is 
sampled from a uniform distribution ZY(0,1) and Ri denotes 
the first two rows of a rotation matrix with angles uniformly 
sampled from [0,27r]. Only a randomly selected subset of 
c i, • • • , Cfc are left as nonzero with a varying cardinality of 
z. Then, W and B are taken as input and MiS are estimated 
by solving (12). The recovery error is defined as 

relative error = ||M — M||i?/||M||i?, (32) 

where M^ are concatenated in M , and M is the algorithm 
estimate. A recovery is regarded as exact if the relative error 

< io- 3 . 

Figure 3 shows the frequency of exact recovery with 
varying p (number of landmarks) and 2 (cardinality of true 
coefficients), which is evaluated over 100 trials for each set¬ 
ting. Note that the number of unknowns (6k) is much larger 
than the number of equations (2 p). The proposed convex 
program can exactly solve the problem with a frequency 
equal to 1 in the lower-triangular area, where the number of 
landmarks is sufficiently large and the coefficients are truly 
sparse. This demonstrates the power of convex relaxation, 
which has proven to be successful in many inverse prob¬ 
lems, e.g., compressed sensing [49] and matrix completion 

]. The performance drops in more ill-posed cases in the 
upper-triangular area. This observation is analogous to the 
phase transition in compressive sensing, where the recovery 
probability also depends on the number of observations and 
the underlying signal sparsity [51]. 

Note that the original model (6) is a special case of the 
relaxed model (31) when M^ are the scaled versions of 
each other. The result obtained by using the original model 
for data generation is similar to Figure 3. 



5 10 15 20 25 

Number of activated bases 


Fig. 4. Representability of principal component analysis (PCA), sparse 
coding (SC) and nonnegative sparse coding (NNSC) for human poses. 

5.2 Human pose estimation 

The applicability of sparse representation for 3D human 
pose recovery has been thoroughly studied in previous 
work [ 8 ], [ 9 ], [1 ]. In this paper, we aim to illustrate the 
advantage of the proposed convex program compared to 
the alternating minimization scheme. 

Empirical evaluation is carried out on the CMU motion 
capture dataset [5^ ]. The dataset includes thousands of 
sequences of 3D human skeletons and part of them are 
selected for evaluation. More specifically, eight motion cate¬ 
gories are considered including walking, running, jumping, 
climbing, boxing, dancing, sitting and basketball. For each 
motion, six sequences are collected from different subjects 
with three sequences for training and the rest for testing. 
Then, a dictionary is learned with all training sequences 
from various motions and used to reconstruct the test se¬ 
quences. A 15-joint model (i.e., head, thorax, pelvis, shoul¬ 
ders, elbows, wrists, hips, knees, and ankles) is adopted to 
represent a human skeleton. 

To build the shape dictionary, all 3D poses in the training 
set are collected and aligned by the Procrustes method. 
The size of the dictionary k is set as 128 yielding an 
overcomplete dictionary since the dimension of the basis 
vector is 45 (15 joints in 3D). The dictionary is initialized 
by uniformly sampling k poses from the training data and 
updated by the algorithm described in Section 4.6. The 
representability of the learned dictionary is measured by the 
reconstruction error, i.e., the first term in (30). To evaluate 
the representability, (30) is solved multiple times with a 
variety of A yielding a sequence of reconstruction errors 
and the corresponding sparsity levels. The more general 
case without the nonnegativity constraint is also tested. The 
result is shown in Figure 4, which indicates that the sparse 
coding needs much fewer activated bases compared to PCA 
to achieve the same reconstruction error, and adding the 
nonnegativity constraint sacrifices the representability but 
the difference is small in our experiment. 

To generate 2D testing data, an orthographic camera 
rotating around the subject (360 degrees per sequence) is 
simulated and the 3D joints are projected to 2D with the 
simulated camera. The following methods are compared: 

• Convex: the proposed convex program with direct 
reconstruction (Algorithm 2+ Algorithm 3). 

• Convex+refine: the proposed convex program fol¬ 
lowed by refinement (Algorithm 2+ Algorithm 4). 
The local update of rotation is implemented by the 
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Objective value (Convex+refine) 
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Fig. 5. Quantitative results on the CMU motion capture dataset, (a) The mean 3D estimation errors for different motions, (b) The box plot of 
estimation errors, (c) The comparison between objective values achieved by the mean shape initialization and by the convex initialization, (d) The 
sensitivity to Gaussian noise, (e) The sensitivity to outliers. The length of error bar in (d) and (e) indicates half standard deviation. 


manifold optimization over the Stiefel manifold with 
the Manopt toolbox ]. 

• Altern: the alternating algorithm with the mean 
shape as initialization (Algorithm 1). The rotation is 
updated by SVD as in previous work [8], [1 ], [17]. 

• PMP: Projected Matching Pursuit from Ramakrishna 
et al. [ 8 ], which solves sparse coding by greedily 
selecting basis shapes instead of i\ minimization. 
The dictionary for PMP is constructed by the method 
provided in the original paper but with the same 
training data as ours. 

The estimation error is evaluated by the 3D Euclidian 
distances between the estimated joint locations and the 
ground truth in the camera frame up to a translation and 
scaling. The mean errors for various motions are shown in 
Figure 5(a) and the statistics of errors in Figure 5(b). The pro¬ 
posed convex method consistently outperforms the noncon- 
vex methods for all motions. Comparing "convex+refine" 
and "altern" which solve the same problem (7) at the end, 
the error is apparently decreased by using better initializa¬ 
tion (the convex solution). To further verify the fact that the 
alternating minimization depends on initialization, we com¬ 
pare the objective values achieved by the alternating algo¬ 
rithm initialized by the convex solution and the mean shape. 
As shown in Figure 5(c), the objective values achieved by 
convex initialization are always smaller. The mean errors 
of "convex" and "convex+refine" are comparable on the 


average. The rotation synchronization and refinement may 
lead to worse reconstructions if the convex solution fails 
to provide a good initialization or the optimal solution to 
(7) is not necessarily the best reconstruction, which may 
happen when the pose to be reconstructed cannot be well 
represented by the learned dictionary. 

Some visual examples of reconstructed poses for various 
motions are shown in Figure 6. All methods perform well in 
the "walk" and "run" examples, where the poses are close 
to the mean pose (stand upright). But for more complex 
motions, the nonconvex algorithms may be stuck at local 
optima as shown in other examples, while the proposed 
convex algorithm still obtains appealing results. A demon¬ 
stration video showing several reconstructed 3D pose se¬ 
quences is included in the supplementary material. The 
reconstructions by the convex method are much smoother 
over time compared to the ones by the nonconvex method 
which suffer from abrupt changes due to the inherent insta¬ 
bility of nonconvex optimization. 

The robustness of the proposed method against Gaussian 
noise is also tested. Gaussian noise with a varying standard 
deviation is added to the 2D input. The estimation error 
versus the standard deviation of the simulated noise is 
shown in Figure 5(d). The performances of the convex 
methods are consistently better than the alternating ones 
under all noise levels. To test the robustness of the extended 
model in (29) against outliers, a proportion of landmarks are 
selected with their 2D locations changed to be some random 
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Fig. 6. Qualitative results on the CMU motion capture dataset. The rows from top to bottom correspond to the input 2D poses, the ground-truth 
3D poses (visualized in a novel view), and the reconstructions from the proposed method, the alternating minimization, and the PMP method [ 8 ], 
respectively. Red and green indicate left and right, respectively. 


values within a proper range (roughly the rectangle that 
covers the skeleton). For comparison, the nonconvex model 
(7) is also extended by introducing an outlier term similar 
to the one in (29) and minimized by the alternating scheme. 
The curves of estimation error versus outlier proportion are 
shown in Figure 5(e). The robust models clearly outperform 
the original models and the convex method achieves a better 
performance than the alternating one. 

To illustrate the real applicability of the proposed 
method, we apply the 2D pose detector from Yang and 
Ramanan [5 ] on the PARSE dataset [55], and use the 
proposed method to lift the detected 2D poses to 3D with 
the dictionary learned on the CMU dataset. Some selected 
results are shown in Figure 7. 

5.3 Car model estimation 

The applicability of the proposed method for car model 
estimation is demonstrated using the Fine-Grained 3D Car 
(FG3DCar) dataset [1 ], which provides images of cars, 2D 
landmark annotations and some 3D models. The 3D models 
of 15 cars are concatenated as the shape dictionary to recon¬ 
struct the other cars from the visible landmarks annotated 
in the images (^-40 points per image). Several examples of 
reconstruction are shown in Figure 8. For comparison, the 
reconstructions from the algorithm proposed in the original 
paper [15] are also shown, which adopts a perspective 
camera model and locally updates the camera and shape 
parameters with a Jacobian system initialized by the mean 
shape and predefined camera parameters. The nonlinear 


optimization performs well in the sedan example but fails in 
the SUV and truck examples, where the models deviate far 
away from the mean shape. Similar results were reported in 
the original paper [15] and the authors proposed to use the 
class-specific mean shape for better initialization. Instead, 
the proposed method can achieve appealing results with 
arbitrary initialization. 

We observed that the alternating minimization in Algo¬ 
rithm 1 also performed very well for car model reconstruc¬ 
tion, as the shape variability of cars was relatively small 
and the mean-shape initialization was very close to the true 
solution. But when there are outliers in the observation, 
the initial estimate of viewpoint might be unreliable and 
the alternating algorithm may fail. To validate this, some 
outliers are simulated. More specifically, for each image 
20 annotated landmarks are randomly selected with their 
locations changed to be random values in the image plane. 
Then, the robust models are solved by the convex method 
and by the alternating algorithm initialized with the mean 
shape, respectively. Since there is no 3D ground truth (the 
3D models provided in the dataset were reconstructed by 
the authors), the 2D shape fitting error is evaluated, i.e., 
comparing the 2D projection of the reconstructed 3D model 
with the original 2D annotations, to see whether the synthe¬ 
sized outliers can be corrected. The comparison is shown in 
Figure 9. While the estimation errors are identical for most 
examples, there are still many examples in which the convex 
method performs apparently better than the alternating 
method. The same conclusion can be drawn by comparing 
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Fig. 7. Qualitative results on the PARSE dataset with 2D poses detected by an existing 2D pose detector [54]. In each example, the detected 2D 
pose superposed on the original image and the reconstruction in two different views are shown. The last row shows two failed examples. 


Image 


2D annotation 2D (convex) 


3D (convex) 


2D (nonlinear [15]) 



3D (nonlinear [15]) 












Fig. 8. Qualitative results on the FG3DCar dataset given 2D correspondences. The columns correspond to the original image, the input 2D 
landmarks, and the 2D/3D models output by the proposed method and the nonlinear optimization [15], respectively. Only visible landmarks (~ 40 
per image) are used for model fitting. The 3D models are visualized in a novel view different from the original image. The car models are the BMW 
5 Series 2011 (sedan), the Nissan Xterra 2005 (SUV) and the Dodge Ram 2003 (pick-up truck), respectively. 
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Error in pixel (convex+refine) Objective value (convex+refine) 


(a) (b) 

Fig. 9. Quantitative comparison between the proposed method (“con¬ 
vex+refine”) and the alternating minimization (“altern”) on the FG3DCar 
dataset with synthesized outliers, (a) The 2D shape error, (b) The 
objective value. 

the objective values. The mean 2D errors of the "altern", 
"convex" and "convex+refine" methods are 44.47, 37.21 and 
34.76 pixels, respectively. The rotation synchronization and 
refinement is beneficial here as the shape variability is small 
and a more constrained model is preferred. 

Finally, we demonstrate the real applicability of the 
proposed method with learned landmark detectors. For 
each landmark, the image patches around the landmark 
in training images are collected as positive examples and 
a number of random patches in the background as negative 
examples. A classifier based on SVM with HOG features is 
trained with the collected examples. To handle 2D appear¬ 
ance variability, the positive examples are clustered and a 
mixture of classifiers is learned for each landmark. During 
testing, a test image is scanned by the learned classifiers 
and the locations with maximum responses are picked as 
the landmark locations. The detections are shown in the 
first column of Figure 10. The false detections are mostly 
due to self occlusion. The visibility is unknown and we 
fit the model with all detected landmarks despite of their 
correctness. The 2D fitted models and 3D reconstructions are 
shown in the next two columns, which demonstrate that the 
proposed method can robustly fit the model at the presence 
of many outliers and without knowing the visibility. The last 
row shows two failed examples, in which more than half of 
the landmarks are incorrectly located. 

5.4 Parameter tuning 

The parameters a and (3 in the proposed model (13) and 
(29) control the sparsity of shape coefficients and outliers, 
respectively. In general, the best values of a and /? can 
be obtained by cross validation for a specific task. But 
empirically we find that the estimation error changes very 
smoothly with the parameters varying in a proper range 
and a set of fixed parameters works well after normalizing 
the data. More specifically, the coordinates of input 2D 
shape and 3D basis shapes are scaled such that the average 
variance of each shape along all directions is unit. With the 
data normalization, the estimation error as a function of 
parameters over a subset of CMU motion capture data is 
shown in Figure 11. Each curve has a relatively flat valley. 
In all experiments in this paper, we fix a = 1 and (3 = 0.1 
after the data normalization. 



p 



(a) (b) 

Fig. 11. Sensitivity to model parameters on the CMU motion capture 
dataset, (a) Estimation error of model (13) as a function of a. (b) 
Estimation error of model (29) as a function of (3 with a = 1 and 20% 
simulated outliers. 


5.5 Computational time 

The proposed algorithms are implemented in MATLAB and 
tested on a desktop with a Intel i7 3.4GHz CPU and 8G 
RAM. In our experiments, the ADMM algorithm generally 
converges within 500 iterations to reach a stopping criterion 
of 10 -4 . In human pose estimation, the average computa¬ 
tional time of our algorithm is 1.91s per frame, while those 
of the alternating algorithm and the PMP algorithm [ ] 
are 1.34s and 3.70s, respectively. Note that the for-loops in 
Algorithm 2 (Line 3) can be paralleled to further accelerate 
the computation. 


6 Discussion 


In summary, we proposed a method for aligning a 3D 
deformable shape model with a sparse representation to 2D 
correspondences by solving a convex program, which guar¬ 
antees global optimality. Intuitively, we adopted an aug¬ 
mented 3D shape space to achieve a linear representation of 
both shape and viewpoint variability and proposed to use 
the spectral-norm regularization to penalize invalid cases 
caused by the augmentation. We also extended our model 
to handle outliers in 2D correspondences. We empirically 
demonstrated the exact recovery property of the relaxed 
model as well as the advantage of convex optimization com¬ 
pared to alternative nonconvex methods especially when 
the shape variability was large and there were outliers. 

We demonstrated the applicability of the proposed 
method for reconstructing human poses and car models. 
The point correspondences across 3D training shapes are 
provided in the used datasets. With the increasing availabil¬ 
ity of techniques for 3D shape matching, e.g. [56], [57], and 
databases with rich annotations, e.g. PASCAL3D+ [58] and 
3D ShapeNet [^ ], we expect to see more applications of 
the proposed framework to reconstruct a variety of object 
categories in future. 
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2D detection 2D fitting 3D reconstruction 



2D detection 2D fitting 3D reconstruction 



Fig. 10. Qualitative results on the FG3DCar dataset with detected 2D landmarks. In each example, the detected landmarks superposed on the 
original image, 2D fitted model and 3D reconstruction are shown. The 2D landmarks are located by learned detectors based on HOG and SVM. 
The green and red dots correspond to the inliers and outliers, respectively. The outlier is defined as any detected landmark at least 30-pixel away 
from the manual annotation. Note that the inlier/outlier classification is only for the purpose of illustration and not for model fitting, i.e., the models 
are fitted with all landmarks despite of the correctness of detection. The last row shows two failed examples. 


Appendix A 

Algorithm to solve the robust model 


The algorithm to solve (29) is presented. The problem is 
rewritten as follows by introducing an auxiliary variable Z\ 

_min - \\w - ZB - E -T 1 t \\ 2 

M,Z,E,T 2 II IlF 

k 

+ aJ2\\M l \\ 2 +(3\\E\\ 1 , 

i= 1 

s.t. M = Z. (33) 


The augmented Lagrangian of (33) is 
(m, Z, E, T, r) =»T | W - ZB - E - Tl r |^ 

k 

+ a'£\\M i \\ 2 +/3\\E\\ 1 (34) 

i= 1 

+ (y,m - z) +-\\m - z\\ . 

\ / 2 11 IlF 


The following steps are iterated until convergence: 
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arg nun 


f '— t F1 

M 

,Z,& 


(36) 

E t+1 = 

arg min 

E 

Ad 

f~t+i 

M 

,z t+1 

,E,T t ,Y t )-, 

(37) 

fjit+l _ 

arg min 

T 


' — 1+ 1 

M 

,z t+1 , 

,E t+1 ,T,Y t '\ ; 

(38) 

yrt+1 _ 

V+/Z 

/•— 1 +1 

(m - 

z t+1 ) 


(39) 


The whole procedure is very similar to Algorithm 2. The 
differences are the steps to update E and T. For (35) 
and (36), they can be computed similarly as the steps in 
Algorithm 2: 

M* +1 =V*(Q\), Vie [l,k], (40) 

where Q\ is the i -th column-triplet of Z l — ^Y l — E l —T l 1 T , 

z t+1 = ((W -E*- T t l r )B r + nM t+1 + V) 

x (BB T + A) 1 • (41) 

For (37), the problem is a proximal problem associated with 
the £i-norm and can be solved by 

E t+l =Sp(w - Z t+1 B - r*l T ) , 


(42) 
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where Sp(X)ij = sign— f3)+ that refers to the el¬ 
ementwise soft-thresholding operator. For (38), the solution 
is simply given by 

T t+1 = mean of each row \w - Z t+1 B - £ ,+1 ] . (43) 

Note that there is no theoretical guarantee for the con¬ 
vergence of ADMM solving a multi-block problem such as 
(33) [60], though it always converges in our experiments. 


Appendix B 

Algorithm to solve dictionary learning 

The algorithm to solve the dictionary learning problem in 
(30) is presented. The cost function can be rewritten as: 

i k 

f(B,C) = - US,- -^CijBtWl + X^Cij, (44) 

i=l ij 

where B is the concatenation of BiS. Then, it is minimized 
by projected gradient descent and the algorithm is summa¬ 
rized in Algorithm 5. 
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16 

17 

18 


Input: S i,--- , S n 

Output: , Bk 

initialize B, C, step sizes Si and S^', 
while not converged do 

/* solve nonnegative sparse coding */ 
while not converged do 

Ci-C-Stfcf&C); 

for i = 1 to k do 
for j = 1 to n do 


if Cij 

I Q 


< 0 then 

7 ^— 0 


end 
end 
end 

/* update dictionary 

B ^ B — <5 2 V gf{B, C ); 

for i = 1 to k do 

if \\Bi 11> 1 then 

I Bi^Bi/WBiWp; 


* / 


end 


19 end 


Algorithm 5: Dictionary learning 
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